Reorganization of Databases by Sectioning

ABSTRACT

Technologies are described for a system and method for reorganizing a tablespace in a database such that rows of the tablespace are arranged in a sequence defined in a balanced tree-type clustering index of the tablespace. The method includes sectioning the clustering index and the tablespace into sections including logically distinct sets of data by reading only tree pages of the clustering index to determine logical divisions. The method further includes allocating an amount of output space on a storage device for each section of the tablespace and of the clustering index, to provide for each section a first range of storage space for an output clustering index for the section, and a second range of storage space for an output tablespace for the section. The method further includes scheduling a reorg task for each section, and executing, by at least one processor, the scheduled reorg tasks on the sections.

RELATED APPLICATIONS

This disclosure is a continuation of U.S. Non-provisional applicationSer. No. 16/361,541, filed Mar. 22, 2019, which is incorporated byreference in its entirety.

BACKGROUND

The present invention relates to reorganization of databases, and inparticular, using a sectioning technique to reorganize large DB2databases.

Description of the Related Art

Tablespaces in databases are now becoming so large that conventionalreorg techniques will not work. Whereas the size of these tablespaceswas previously measured in megabytes (M), then in gigabytes (G), atablespace may now be of a much larger scale, containing terabytes oreven petabytes of data. The normal reorg technique is to read the data,sort it by the clustering index key value, then write the Tablespace andClustering Index from the sorted output. Regardless of the advancednature of modern direct-access storage device (DASD) devices, thisprocess becomes impossible or, at least, totally impracticable with suchlarge tablespaces. Some techniques exist to reorg a tablespace withoutsorting, but even these fail with the enormous sizes of the data.

SUMMARY

Technologies are described for a method of reorganizing a tablespace ina database such that rows of the tablespace are arranged in a sequencedefined in a balanced tree-type clustering index of the tablespace. Themethod includes sectioning, by a processor, the clustering index and thetablespace into sections including logically distinct sets of data byreading only tree pages of the clustering index to determine logicaldivisions. The method further includes allocating, by a processor, anamount of output space on a storage device for each section of thetablespace and of the clustering index, to provide for each section afirst range of storage space for an output clustering index for thesection, and a second range of storage space for an output tablespacefor the section. The method also includes scheduling, by a processor, areorg task for each section, and executing, by at least one processor,the scheduled reorg tasks on the sections.

In another aspect, a system includes a storage facility storing atablespace of a database and a balanced tree-type clustering index ofthe tablespace, the storage facility connected to a channel andconfigured to transfer data over the channel; and a computer includingat least one processor, a memory, and communication circuitry connectedto the channel and configured to transfer data over the channel. Thecomputer is configured to reorganize the tablespace such that rows ofthe tablespace are arranged in a sequence defined in the clusteringindex of the tablespace by executing instructions that cause the atleast one processor to: section the clustering index and the tablespaceinto sections comprising logically distinct sets of data by reading onlytree pages of the clustering index to determine logical divisions;allocate an amount of output space on the storage facility for eachsection of the tablespace and of the clustering index, to provide foreach section a first range of storage space for an output clusteringindex for the section, and a second range of storage space for an outputtablespace for the section; schedule a reorg task for each section; andexecute the scheduled reorg tasks on the sections.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a balanced tree structure such as is used inan index of the DB2 database management system.

FIG. 2 illustrates an example search for a table entry in a balancedtree index structure such as is shown in FIG. 1.

FIG. 3 shows a sequential representation of an index file in the DB2database management system.

FIG. 4 shows a high-level overview of a reorg operation, in accordancewith a disclosed embodiment.

FIG. 5 shows a process for assigning or defining sections using a “newkey” bitmap, in accordance with a disclosed embodiment.

FIG. 6 illustrates an example of assigning new key bitmap values,according to the process shown in FIG. 5.

FIG. 7 shows a process of section scheduling, and of handling outputtablespace and index sections that do not fit into their pre-allocatedoutput space, in accordance with a disclosed embodiment.

FIG. 8 shows an illustrative computing environment in which the reorgprocess of the various embodiments may be used.

DETAILED DESCRIPTION

In the following detailed description, reference is made to theaccompanying drawings that form a part hereof. In the drawings, similarsymbols typically identify similar components, unless context dictatesotherwise. The illustrative embodiments described in the detaileddescription, drawings, and claims are not meant to be limiting. Otherembodiments may be utilized, and other changes may be made, withoutdeparting from the spirit or scope of the subject matter presentedherein. It will be readily understood that the aspects of the presentdisclosure, as generally described herein, and illustrated in thefigures, can be arranged, substituted, combined, separated, and designedin a wide variety of different configurations, all of which areexplicitly contemplated herein.

Disclosed herein is a method of reorganizing a tablespace in a databasesuch that rows of the tablespace are arranged in a sequence defined in abalanced tree-type clustering index of the tablespace. The methodincludes sectioning, by a processor, the clustering index and thetablespace into sections including logically distinct sets of data byreading only tree pages of the clustering index to determine logicaldivisions. The method further includes allocating, by a processor, anamount of output space on a storage device for each section of thetablespace and of the clustering index, to provide for each section afirst range of storage space for an output clustering index for thesection, and a second range of storage space for an output tablespacefor the section. The method also includes scheduling, by a processor, areorg task for each section, and executing, by at least one processor,the scheduled reorg tasks on the sections.

In another aspect, a system includes a storage facility storing atablespace of a database and a balanced tree-type clustering index ofthe tablespace, the storage facility connected to a channel andconfigured to transfer data over the channel; and a computer includingat least one processor, a memory, and communication circuitry connectedto the channel and configured to transfer data over the channel. Thecomputer is configured to reorganize the tablespace such that rows ofthe tablespace are arranged in a sequence defined in the clusteringindex of the tablespace by executing instructions that cause the atleast one processor to: section the clustering index and the tablespaceinto sections comprising logically distinct sets of data by reading onlytree pages of the clustering index to determine logical divisions;allocate an amount of output space on the storage facility for eachsection of the tablespace and of the clustering index, to provide foreach section a first range of storage space for an output clusteringindex for the section, and a second range of storage space for an outputtablespace for the section; schedule a reorg task for each section; andexecute the scheduled reorg tasks on the sections.

As will be discussed below, a database reorg process according to thedisclosure may be able to efficiently split reorg processing intoindependent sections, and schedule the sections for concurrentprocessing. Such a reorg process may, therefore, improve the functioningof the computing environment in which it is used during a database reorgprocess.

Embodiments in this disclosure will generally be described in terms ofstructures and operations provided in the DB2 database managementsystem, by International Business Machines Corporation of Armonk, NewYork. The DB2 database management system generally runs on the Z/OSoperating system, also by International Business Machines Corporation ofArmonk, New York. It will be understood that embodiments could be usedwith other database management systems and/or other operating systems,provided that such database management systems include structuressimilar to those described herein.

Terminology used herein may be specific to the DB2 database managementsystem and/or the Z/OS operating system. It will be understood thatsimilar structures that are described using different terminology couldbe used in accordance with various embodiments. For clarity, a number ofterms used herein are described below, so that readers who areunfamiliar with terminology associated with the DB2 database managementsystem can more readily comprehend the structures and/or functionsdescribed herein.

As used herein, a “tablespace” is a dataset containing logicallyconstructed user-defined collections of data known as “rows.” Within theZ/OS operating, such a “dataset” is an object that defines a collectionof data on an external device, such as a direct-access storage device(DASD).

As used herein, an “index” is a logical structure of user-defined keyvalues that indirectly reference the tablespace rows. In the DB2database management system, such indexes can be “clustering indexes” or“non-Clustering indexes.” A “clustering index” is the index defined tocontrol the physical ordering of the tablespace rows. That is, thedatabase management system attempts to keep the rows in the same orderas the corresponding keys in the clustering index. This provides a morelinear access path to data, which may improve the performance of queryoperations. Additionally, because rows with similar clustering index keyvalues are stored together, sequential detection prefetching is moreefficient when clustering indexes are used. Generally, each tablespacecan have only one clustering index, since the order of the rows cancorrespond to the order of the keys in only a single index.

For a “non-clustering index” the database management system does notmaintain any order of the data in the tablespace when compared to theorder of the corresponding keys of the index. A tablespace may havenumerous non-clustering indexes.

For example, suppose that there is a tablespace containing rows, whereineach row describes an individual sale of an automobile. There could benumerous indexes associated with this tablespace, such as an index whosekey is the seller of the car, an index whose key is the buyer of thecar, and an index whose key is the model of the car. For example, theindex for the model of the car would permit referencing all AUDI cars orall CADILLAC cars without needing to read the entire tablespace.

If the index whose key is the model of the car is the clustering index,then the rows in the tablespace would be physically arranged in thisalphabetical order, so all the rows for AUDI cars would follow the rowsfor ASTIN MARTIN cars, and precede the rows for BMW cars.

When subjected to processing, the rows in the tablespace might becomedisordered, so that there would no longer be this strict arrangement orordering of rows. The process of re-arranging the tablespace rows intotheir clustering index sequence order is known as a “reorg.”

As used herein, a “section” is defined as a consecutive set of indexkeys of a clustering index and all the associated tablespace rows. Itmay also refer to a collection of pages of any type of index if theprocess is REORG INDEX operation in the DB2 database management system.The result of a reorg is that the sections of the clustering index wouldoccupy a contiguous portion of the output reorged clustering indexdataset, and the associated section of the tablespace would occupy acontiguous portion of the output reorged tablespace.

The DB2 database management system supports “partitioned tablespaces,”wherein the database management system sections sets of tablespace rowsinto different datasets based on a key value derived from the row knownas the “partitioning key.” There is no requirement that there be auser-defined index using this partitioning key. In the example above,the partitioning key could be the automobile VIN number in user-definedranges of VIN values.

To further assist in understanding the various embodiments describedherein, examples of structures of indexes in the DB2 database managementsystem are illustrated and described below.

FIG. 1 shows an example of a balanced tree structure 100, such as usedin an index in the DB2 database management system. A root page 102 is atthe top of the tree and contains entries 104 pointing to illustrativetree pages 106 a and 106 b. Each of the tree pages 106 a and 106 bcontains a plurality of entries 108 that point to various leaf pages 110a-110 c. Each leaf page 110 a-110 c contains one or more entries 112that directly point to rows 114 a-114 c in the tablespace 120 with whichthe index is associated.

In the DB2 database management system, the entries in each individualleaf page 110 a-110 c are automatically kept in logical order by thesystem. Similarly, the entries in each of the tree pages 106 a and 106 band the root page 102 are also automatically kept ordered in logicalorder within the page itself.

The leaf pages 110 a-110 c may be referred to as being at “level 1” ofthe balanced tree structure 100. The tree pages 106 a and 106 b are onelevel above the leaf pages, and are therefore referred to as being at“level 2” of the balanced tree structure 100, and the root page 102 isreferred to as being at “level 3” of the balanced tree structure.Although this small example has only three levels of pages, in a largedatabase, there may be many additional levels of tree pages, so that theroot page of such a balanced tree structure will be at a “level” in thestructure higher than 2. Additionally, it will be understood that theleaf pages, tree pages, and root page may contain many more entries thanare shown in this example.

As shown in FIG. 1, the root page 102 contains a series of entries 150,wherein each entry 150 includes a page number 152 and a highest keyvalue 154 for the page that is pointed to by the page number 152. Thetree pages 106 a and 106 b also include a series of entries 156, whereineach entry 156 includes a page number 158 and a highest key value 160for the page (in this case, a leaf page) that is pointed to by the pagenumber 158. Finally, each of the leaf pages 110 a-110 c includes aseries of entries 162, each of which includes a pointer 164 to a row 114a-114 c in the tablespace 120, and a key value 166 for the row that ispointed to by the pointer 164.

It will be understood that this is a simplified view, intended only toprovide an overview of a balanced tree index structure for use inunderstanding the embodiments of the disclosure. In an actual index usedin the DB2 database management system, there may be additionalinformation, such as various header information that is present in theroot page, tree pages, and leaf pages. Additionally, there may manydetails, such as the use of “null” as the highest key value, or theexistence of variations on the structure of the leaf nodes, which, whileuseful for a DB2 database management system developer to understand, arenot necessary for purposes of understanding the embodiments disclosedherein. Additionally, in a DB2 index, the tree page entries include thefirst key of the next logical lower level page, rather than the highestkey value. This has the same effect as keeping the highest key value, asdescribed above, but may make inserting easier.

Further, it will be understood that if the index is sufficiently small,no tree pages are actually present, and the root page can actually be aleaf page. However, this is an unusual condition, and is extremelyunlikely to be the case for the large databases with which the variousembodiments disclosed herein are intended to be used.

A more conventional structure, as indicated before, has multiple layersof tree pages between the root page and the leaf pages. If theparticular structure of the index file had three levels, then the rootpage would indicate the appropriate number of references to specifictree pages, which tree pages would then also contain entries to anotherlevel of tree pages, which tree pages would then finally point to theleaf pages. It is also noted that for the examples in this specificationshow only a very limited number of entries in each page for simplicity.It is understood that in actual use a root or tree page will contain upto thousands of entries, depending on the length of the key value, whilethe leaf pages may contain hundreds of entries, though normally a treepage will have approximately 100 entries and a leaf page will haveapproximately 70 entries.

It should be noted that in a typical index, the vast majority of pagesare leaf pages. The number of tree pages as a fraction of the totalnumber of pages in an index is normally around 0.5%. This number willvary based on many factors, including key length and the distribution ofkeys (specifically, how similar is the next key to the prior), sincekeys in the tree pages are normally forward compressed, so that only thevalues different at the end of a key are saved. Assuming 0.5% of allpages in an index are tree pages, a very large index of 1 T has 256Mpages, of which about 1.25M will be tree pages, and about 255M will beleaf pages.

FIG. 2 illustrates a search for a table entry in a balanced tree indexstructure such as is shown above. In particular, a search through anindex in the DB2 database management system is illustrated.

For this example, a search for “E” is being conducted in the index 200on a single character. The search would start at root page 202, and lookat the highest key value for each of the pages that are pointed to bythe root page 202. First, the search will determine whether “E” is equalto or before “C”, which is the highest key value for the first tree pagepointed to by the root page. Since this is not the case, the search willproceed to check against the highest key value for the next page. Thus,the search will check whether “E” is equal to or before “J”. Since thisis the case, the search through the pages pointed to by the root page202 can stop, and the search can follow the pointer to the tree page204.

The search can then proceed to check against each of the highest keys inthe leaf pages that are pointed to by the tree page 204. This process isessentially the same as the previous. Since “E” is not equal to orbefore “D”, the search will proceed to check against the next entry inthe tree page. Because “E” is equal to or before “F”, the pointer whichpoints to the leaf page in which the highest key value is “F” will befollowed to leaf page 206.

Once the correct leaf page 206 has been identified, the search willlocate the key value “E” in the leaf page, and obtain the associated RID(record identifier). The RID provides the location (page and slotnumber) of the row in the tablespace that is indexed by the index 200that contains the particular record.

In this example, it required only four page reads to get to a specificrow. In general, the number of page reads will be proportional to alogarithmic function of the number of records indexed. This isconsiderably more efficient than, e.g., searching linearly through therecords.

FIG. 3 illustrates a sequential representation of an index file in theDB2 database management system. An index file in the DB2 databasemanagement system on the Z/OS operating system is typically a virtualstorage access method (VSAM) data set in a linear data set (LDS) formatwhich is essentially a byte-stream data set format.

The index file 300, has a first page 302, which is referred to as page0, and which serves as a header page. This header page containsparticular information needed to indicate to the various packages andprograms, and particularly to the operating system of the machine, thetype of file and other related information. The second page 304 of theindex file 300 is a space map, which indicates which particularsubsequent pages in the file are currently active for that particularindex file. These subsequent pages are either tree pages (including theroot page) or leaf pages. The third page 306, referred to as page 2, isthe root page R. The remaining pages 308 in the index file 300 are tree,leaf, empty, and unused pages, in whatever order has developed throughthe use of the indexing process.

As discussed above, tablespaces are now becoming so large thatconventional reorg techniques will not work. Whereas the size of thesetablespaces was previously measured in megabytes (M), then in gigabytes(G), a tablespace may now be of a much larger scale, containingterabytes or even petabytes of data. The normal reorg technique is toread the data, sort it by the clustering index key value, then write theTablespace and Clustering Index from the sorted output. Regardless ofthe advanced nature of modern DASD devices, this process becomesimpossible or, at least, totally impracticable with such largetablespaces. Some techniques exist to reorg a tablespace withoutsorting, but even these fail with the enormous sizes of the data.

According to embodiments of the disclosure, a reorg operation can beaccomplished on a large index or on a large index and its associatedtablespace, by using a reorg with the following characteristics: objectsdivided into independent sections; single read of each index; singlewrite of each index; single read of the tablespace; single write of thetablespace; parallel read and write of all objects in multiple tasks ofa single processor or in multiple tasks of multiple processors connectedin a SYSPLEX or other clustered, distributed, or parallel computingarrangement; no call to an external sort or sort product; and allobjects remain in read/write mode throughout the process.

FIG. 4 shows a high-level overview of such a reorg operation 400. At402, the clustering index and the tablespace are broken up intologically distinct sets of data, referred to herein as “sections.”Assigning sections may be done by reading the clustering index inlogical order (i.e. key sequence) to determine logical divisions. Usingcurrent technology, this process might take several hours for a verylarge clustering index (e.g. 2 T), but could take much longer for aclustering index in a highly disorganized state.

In accordance with various embodiments of the disclosure, reasonablyconsistently-sized sections can be assigned by reading only the treepages of the clustering index. Since, as discussed above, the tree pagesmake up only about 0.5% of the pages in the index (the rest being leafpages), the number of index pages that are read for assigning sectionscan be kept relatively low (at least in comparison with the total sizeof the index), greatly reducing the amount of time that is needed forthis step.

At 404, once the sections have been assigned, the reorg operationestimates, with as high a degree of accuracy as it is able to achieve,the amount of output space that will be needed for each section of thetablespace and of the clustering index. When using the Z/OS operatingsystem, the amount of output space may be DASD space, expressed, e.g.,in cylinders.

Once the amount of space has been estimated, a section can be viewed asa defined part of the clustering index-tablespace that includes alogical set of index leaf pages, a range of storage space (e.g., a rangeof cylinders) for the output index, and a range of storage space (e.g.,a range of cylinders) for the output tablespace. Thus, each section isself-contained, with the exception of space map pages that cover pagesthat range across sections and the high-level tree pages of the index.

Either as part of the estimation process or as a separate process, theestimated space can be allocated on a storage device.

At 406, the reorg of each section can be scheduled, and carried outindependently across multiple processors or computers in a clustered,parallel, or distributed system, such as a SYSPLEX using the Z/OSoperating system. Because the sections are assigned and provided withstorage space so that the sections are independent, they can beprocessed in parallel. Parallel processing of the sections may achievenear-linear speedup.

As will be described in greater detail hereinbelow, there may beinstances in which a section does not fit within the storage space thathas been allocated for it based on the estimation discussed above. Insome embodiments, processing of sections may be scheduled to reducefragmentation in the cases where the pre-assigned output storage rangesare insufficient to hold the entire section.

It should be noted that memory in a mainframe is a constant limitation,that also affects the parallel or concurrent scheduling of sectionsduring a reorg operation. To control the memory used by each process, areorg according to some embodiments of the disclosure may use “packets.”A packet is a fixed, non-logically consistent but contiguous set of keysand rows that is a subset of a section. The packet can be seen as theunit of processing in the reorg operation, and the number of packets inactive use is the number of concurrent tasks processing the index andtablespace. The size of a packet is designed to fit within a reasonableamount of memory yet produce enough data to substantially advance thework of finishing the reorg. The packet may also be used as a unit ofmeasurement to display the progress of the work to the user.

In accordance with various embodiments, packets are a mechanism forincreasing the parallel nature of the processing. As shown at thesub-process 408, numerous packets may be scheduled to run concurrentlywithin each section. Although these packets processing within a singlesection will generally be scheduled to finish serially, by that time,the bulk of the computing will have been completed. In some embodiments,a packet task manager process will limit the number of concurrentpackets to restrain memory usages, but as many as seven to eight packetsmay be active per section. While about four or five packets are doingthe serial work related to writing, additional packets can be doing theinput and row processing work, which, in many cases, will besubstantial. This includes possible decompression, compression, discardprocessing, and row rebuilding due to added or dropped columns. Thus,packets are a way to both increase the parallelism, as well as toprovide for greater overlap between reading and writing.

Generally, the number of packets concurrently being processed is limitedby the amount of CPU memory allocated to the reorg process. Usingpackets, the reorg operation is able to provide a process that workswithin a user-defined limitation of memory, while still being able toreorg an enormous object.

At 410, in a “wrap-up” process, all the separate sections of thetablespace and index are “knitted” together to form a final version ofthe newly reorged tablespace and index. This generally involvesresolving space map pages for the tablespace and clustering index, andwriting high-level tree pages in the index.

FIG. 5 shows a more detailed view of a process 500 for assigning ordefining sections. In accordance with various embodiments, this may beaccomplished using a “new key” bitmap.

At 502, the balanced tree that forms the clustering index is traversedin logical order. If this tree traversal were to proceed to the level ofthe leaf pages, then the leaf pages would be read in perfect indexorder. Algorithms for traversing trees in this manner are well-known. Inaccordance with various embodiments, however, it is not necessary toread the leaf pages to form the new key bitmap. The new key bitmap maybe build using information from “level 2” index pages—i.e., tree pagesthat are one level above the leaf pages in the balanced tree. This meansthat it is only the tree pages that are traversed. Because there aremany fewer tree pages than leaf pages (as explained above, typicallyonly about 0.5% of the pages in an index are tree pages), traversingthis tree and building the new key bitmap may be much faster thantraversing the entire tree, including the leaf pages.

At 504, at each level 2 tree page, one bit is allocated in the new keybitmap for each leaf page that is pointed to by the level 2 tree page.The bits are ordered according to the order of the leaf pages in theindex logical sequence.

At 506, the bits allocated at 504 are assigned values, such that a bitis turned on if it is known for certain that a new key starts in theleaf page corresponding to that bit. The bit is left off if either it isknown that there is NOT a new key or if the decision is uncertain (i.e.there is not sufficient information to know one way or the other).

Assigning these bit values is done using information present in thelevel 2 tree page. Each level 2 tree page includes the indexed leaf pagenumber and the value of the first key in the logically next leaf pagefor each leaf page pointed to by the level 2 tree page. There is also anindicator flag marking whether or not that key is the same as the lastkey in the referenced leaf page (referred to herein as the “keycontinuation flag”).

Using this information, a bit in the bitmap corresponding to a leaf pagecan be turned on if the leaf page is the first leaf page in the index,and, in the case of other leaf pages, if the key continuation flag ofthe previous leaf page is not set (i.e., the first key in the currentleaf page is not the same as the last key of the previous leaf page).The bit in the bitmap corresponding to the last leaf page in the indexshould also be turned on. Otherwise, the bit should be left off. At theend of every level 2 tree page is a single page number without a valuefor the first key in the logically next leaf page, and without keycontinuation flag. Therefore, the bit corresponding to the next logicalleaf page—i.e., the first page of the next level 2 tree page—should notbe turned on.

It will be understood that in some embodiments, bitmap allocation (i.e.,as in 504) and assignment of bit values (i.e., as in 506) may not beseparate, so bits in the new key bitmap are simply assigned a value asthey are allocated. These are shown separately here only for purposes ofillustration.

For clarity of understanding, referring briefly to FIG. 6, an example ofassigning new key bitmap values is described. The leaf pages pointed toby the first level 2 tree node in an index are shown in row 602, whilethe values of the first key in the logically next leaf page for eachleaf page, along with the key continuation flag (indicated by a “+” ifthe flag is set) for each leaf page are shown in row 604.

As shown, it can be determined that the first key in page 4 is “B” andthat the first key in pages 5 and 6 is “C”. The last key in page 4 is“C” (“C+” in page 4 means the first key in page 5 is “C” and it is acontinuation from the last key in page 4) and all the keys in page 5 are“C”. It is not known whether there any other keys in page 6 except “C”;all the keys in page 6 might be “C” or there may be additional keysbetween “C” and “F”.

Therefore, we know there is a new key in page 3, since it is the firstleaf page in the index—so the bit corresponding to page 3 in the new keybitmap shown in row 606 can be turned on. There is also a new key inpage 4, since the first key is “B” and it is not a continuation frompage 3, so the bit corresponding to page 4 can be turned on. There isnot a new key in page 5, and page 6 is unknown, so the bitscorresponding to pages 5 and 6 cannot be turned on. The bitcorresponding to page 7 will be turned on, because the “F” is not acontinuation from page 6 (there is no “+” indicator on the “F”). Thus,the new key bitmap for this level 2 tree page is “11001”, as shown inrow 606.

Now referring back to FIG. 5, the new key bitmap can be used to assignor define sections. At 508, this involves identifying possible start andend leaf pages for sections. A section should begin with leaf pagemarked with a “1” (or “on”) in the new key bitmap, and end with a leafpage marked with a “1” (or “on”) in the new key bitmap. This means thatat a later time, when the leaf pages are read, the section will start atthe beginning of the first new key (we know there is one) in the leafpage at the start of the section, and end with all the keys before thefirst new key in the ending leaf page of the section. Subsequentsections begin on the same page on which the prior section ends.

At 510, the start and end leaf pages of a section are selected from thepossible start and end leaf pages by attempting to find a section inwhich the size of the tablespace section (i.e., the size of thetablespace rows referenced by the leaf pages from the section start tothe section end) corresponds as closely as can be found to apredetermined target size.

The intent is to determine the size of each section so as toapproximately fill the target size in the tablespace. Section sizes mayvary depending on the characteristics of the clustering index and thelevel of uniqueness of that index. The size chosen for the predeterminedtarget size is arbitrary, but should be large enough to set a reasonablenumber of sections for the entire tablespace. The assigned section sizefor each of the sections should be large enough that the space allocatedcan be managed so there is no “spill” into the next section, but smallenough so that there is significant parallel processing. Note that allthe processing for a section is independent from any other sections aslong as all the data fits into the assigned section size.

For example, for a one terabyte tablespace, if the predetermined targetsize is 16 G, then the reorg process will try to assign sections thatwill fill approximately 16 G of tablespace. Assuming that such sectionscan be assigned, the tablespace will have 64 16G sections, which willallow the reorg of the tablespace to be split into as many as 64independent tasks which may be processed in parallel.

It should be noted that in addition to defining a section, an estimatefor the amount of storage space for the section is also produced at 512.The section has been defined to use an area of the output tablespacethat is approximately the predetermined target size (e.g., 16 G in theexample). This area will include all data and free pages and anyencompassing space map pages. The size of the section may be larger,depending on occurrence of very long strings of duplicate clusteringindex keys. All rows for a duplicate key must be included in the samesection.

At 514, once the size of a section is estimated, storage space (e.g., arange of cylinders) for the output tablespace can be allocated.

At 516, the size of the output clustering index for a section isestimated. This estimate may be based on the ratio of tablespace toindex pages in the input objects. The output index section will includeall new leaf pages, tree pages specific to this section, and space mappages that encompass any pages in the section.

A number of factors can contribute to an error in the estimated storagespace. For example, in the DB/2 database management system, index keyextension via an ALTER statement, or changes to PCTFREE (a parameterused to set a portion of each page that should be left empty) orFREEPAGE (a parameter used to specify that an empty page should be leftevery n pages) could cause errors in the estimated output size. The newPCTFREE and FREEPAGE have been included in the estimation of the outputsize, but the original values were not available, so there is no way ofknowing the size of the change in these values.

At 518, once the size of the storage space for the output index sectionis estimated, storage space (e.g., a range of cylinders) for the outputindex section can be allocated.

FIG. 7 shows a more detailed process 700 of section scheduling, and ofhandling output tablespace and index sections that do not fit into theirpre-allocated output space. As noted above, each section isself-contained, with the exception of Space Map pages that cover pagesthat range across sections and the high-level tree pages of the index.Each section can, therefore, be scheduled independently across multipletasks, which may execute in parallel on multiple processors orcomputers, such as parallel, clustered, or distributed systems. Forexample, the tasks handling the sections could be distributed amongmembers of a SYSPLEX under the Z/OS operating system.

It will be understood that in some embodiments, a separate scheduler orgovernor process may be used to handle scheduling. By using a singlescheduler process to schedule processing of the sections, raceconditions, such as multiple sections being processed in parallelallocating and writing the same physical tracks on a direct-accessstorage device may be prevented. Alternatively, other known mechanismsfor avoiding such race conditions, such as semaphores or otherserialization mechanisms, could be used.

At 702, processing of sections is scheduled (e.g., by a scheduler task,as discussed above), and sections are distributed among processors orcomputers for concurrent or parallel processing. As noted above,sections are scheduled to reduce fragmentation in the cases where thepre-assigned output space (e.g., cylinder ranges) is insufficient tohold the entire section. In accordance with various embodiments, thismay be accomplished by scheduling the sections such that, except for thefinal scheduling pass, no section is scheduled unless the next physicalsection has not yet been processed. This means that section outputprocessing can overflow into the next section's pre-assigned area.Additionally, by scheduling in this manner, race conditions aregenerally avoided (even without use of a separate scheduler task),except possibly during the final scheduling pass (or “round”—see below).

In accordance with some embodiments, reducing the effect ofmiscalculating the output sizes of sections for the index and thetablespace may be accomplished by scheduling the sections in rounds.Using the integer value of the square root of the total number ofsections as the interval between sections scheduled in any roundseparates the processing of sections by a size large enough tocompensate for varying section sizes. So, for example, a one terabytetablespace could have 64 16 G sections (though the number and size mayvary, depending on the size estimates for each section), so the intervalbetween sections would be 8. So, sections 1, 9, 17, 25, 33, 41, 49, 57would be scheduled on the first round, then sections 2, 10, 18, etc. onthe second round, and so on. As each section completes, the actual sizeused is returned to the scheduler to adjust the starting position of thenext physical section.

Using this method for scheduling processing of the sections, there maystill be a problem with “fitting” the last round of sections into theallotted space, but the problem has been limited to a fraction of thesections—those in the last round. Also, the process of adjusting thesection starting points dynamically may have the effect of smoothing outvariations in tablespace row lengths in various sections.

In accordance with some embodiments, at sub-process 503, packets mayalso be scheduled, e.g., by a packet scheduler process, which may be apart of the scheduler process described above. The packet schedulerprocess will schedule packets according to the amount of availablememory for processing a section. For each section, numerous (typicallyabout four or five) packets may be scheduled to handle the serial workrelated to writing a section, while additional packets may beconcurrently scheduled to handle input and row processing work in asection, which, in many cases, will be substantial. This includespossible decompression, compression, discard processing, and rowrebuilding due to added or dropped columns.

At 704 the leaf pages in each section are processed. This involvesreading a leaf page that is in the section, then reading the tablespacerows that are pointed to by the leaf page in logical order (which maynot correspond to their physical order in the tablespace—hence the needfor a reorg), writing the tablespace rows into the allocated space forthe output tablespace section such that the physical order of the rowscorresponds to the logical order, and preparing new entries in one ormore output leaf pages for the output index. As output index leaf pagesare completed, entries in output tree pages pointing to the output leafpages may also be prepared.

Additionally, in leaf pages in an index of the DB2 database managementsystem, there may be special requirements for the output index thataffect section processing. In indexes in the DB2 database managementsystem, each leaf page must contain the page number of the prior and thenext logical leaf page. To accomplish this across sections, before anyreading of the leaf pages, the page number of the last leaf page of asection must be assigned and stored in the first leaf page in the nextsection. Note that the page numbers for the previous and next leaf pageare actual physical page numbers, not relative numbers, making itpossible to solve the problem of a large underestimate of the number ofleaf pages in a section. A large number of new output leaf pages arebuilt and saved in memory before writing begins. If the section limit isimminent or has been exceeded, then a search is made for room fromanother section for the excess pages. The vast majority of the time, allthe remaining pages for the index for this section will be in memory sothe exact size required is known, but provision should be made for theexceptional cases.

Additionally, in the DB2 database management system, there must becompensation for the possible occurrence of a space map page. Thesespace map pages occur at specific physical page intervals in everyindex, so it is generally possible to determine if one will be at thelast page of the section.

Generally, the first and last leaf pages in a section are special cases,since, as explained above, the last leaf page in a section may be thesame as the first leaf page in the next section. In such a case, onesection ends, and the next begins at the first new key (i.e., the firstkey that is not continued from the previous leaf page) in the leaf page.

To handle the special cases for the first and last leaf pages in asection, in the DB2 database management system, a flag in the leafpages, referred to as the NORIDP flag is used. The NORIDP flag normallyindicates that no RID (record identifier for a row) follows the previousleaf page high key entry—i.e., the first key in the page is a new key.

If the first leaf page in the section has the NORIDP flag on, this meansthat the first key in this page is not a continuation of the last key inthe prior leaf page, so section processing begins with the first key inthe leaf page. If the first leaf page in the section does not have theNORIDP flag on, this means that the first key in the leaf page is acontinuation of the last key in the prior leaf page, so sectionprocessing beings with the second key in the page. Note that in thiscase, since sections begin on pages where there is a changed key (see,e.g., the process described with reference to FIG. 5), there shouldalways be a second key in the page. If there is no second key in thepage, then the tree page assignment routine is flawed, and there is alogic error that may terminate processing.

If the last leaf page in a section has the NORIDP flag on, then, asabove, the first key in this page is not a continuation of the last keyin the prior leaf page, so section processing ends without anyprocessing of this leaf page. If the last leaf page in the section doesnot have the NORIDP flag on, then the first key in this page is acontinuation of the key in the prior leaf page, so the first key isprocessed and then that ends section processing.

At 706, it is determined whether the newly built output tablespace rowsfit within the pre-allocated output tablespace. If yes, then at 708, ifthere is any remaining output space, the remaining space will beformatted as empty tablespace pages. Empty space (e.g., empty cylinders)will be saved in an “overflow” bucket list. Generally, the “overflow”bucket list needs to be serialized to avoid race conditions, and (insome embodiments, as discussed above) will be handled in the separatescheduler process. For example, in embodiments using the DB2 databasemanagement system running on the Z/OS operating system, areas of theoverflow bucket list will be assigned by a single locked processoracross all the concurrent section tasks, whether in the local CPU or ona remote SYSPLEX CPU. Therefore, in such embodiments, there will be anentire facility that included cross-memory WAIT and POST and SYSPLEXcommunication.

At 710, if the newly built tablespace rows will not fit in the assignedoutput tablespace, then adjustments to the storage of the rows can beused to attempt to fit the rows into the pre-allocated output tablespacefor the section. In the DB2 database management system, for example, thePCTFREE and FREEPAGE values may be adjusted to attempt to fit the datainto the assigned space. If these dynamic adjustments fail, then thereorg process may continue writing the rows over the next section area.No section is scheduled until the immediately prior section hascompleted, except for the final round of processing (see above), so itwill generally be safe to take some space from the space that waspre-allocated for the next section. During the final round ofprocessing, the section information table includes a flag indicating thespace that was pre-allocated for the next section is not available. Thereorg process may then signal to the scheduler that additional space isrequired, so the nearest “overflow” area can be assigned. Whenprocessing the section is complete, the used range of storage space(e.g., the used range of cylinders) and the free range of space arereturned to the scheduler along with a completion indicator.

Incomplete space map pages are written to a small work file so they canbe read by the “wrap-up” process. The “wrap-up” process is generallyresponsible for resolution of the space map pages for the outputtablespace.

If the newly built index pages do not fill the pre-allocated outputcylinders, the remaining pages will be formatted as empty index pages.Empty cylinders will be saved in the “overflow” bucket list.

At 712, it is determined whether the newly built index pages will fitinto the assigned output index space. If yes, then at 714, any remainingpages will be formatted as empty index pages. Empty space (e.g., emptycylinders) will be saved in the “overflow” bucket list.

At 716, if the newly built index pages will not fit in the assignedoutput space, the reorg process may continue writing over the spaceallocated for the next index section. As with tablespace, this isgenerally possible because no section is scheduled until the priorsection has completed, except for the final round. During the finalround, the section information table includes a flag indicating the nextsection is not available. The reorg process will then signal to thescheduler that additional space is required so the nearest “overflow”area can be assigned.

When the section is complete, the used range of output index space(e.g., the used range of cylinders) and the free range of space arereturned to the scheduler along with a completion indicator.

Incomplete space map pages and highest level tree pages (i.e., the treepages above, e.g., level one) are written to a small work file so theycan be read by the “wrap-up” process. The “wrap-up” process handlesresolution of the space map pages for the output index and the highestlevel tree pages.

FIG. 8 shows a computing environment 800 in which the reorg process ofthe various embodiments may be used. Because it is able to efficientlysplit reorg processing into independent sections, and schedule thesections for concurrent processing, the reorg process of the disclosureimproves the functioning of the computing environment during the reorgprocess. While generic components are shown, it will be understood thatthe computing environment 800 could be, for example, a SYSPLEX orparallel SYSPLEX operating under the Z/OS operating system.

The computing environment 800 includes computers 802 a-802 c, whichcould be, for example, mainframe computers manufactured by InternationalBusiness Machines Corporation, interconnected by a channel 804, whichmay be, e.g., a high-speed link or network. Each of the computers 802a-802 c includes one or more processors 806 a-806 c, memory 808 a-808 c,optional local storage 810 a-810 c, and communication circuitry 812a-812 c, which handles transferring data over the channel 804. It willbe understood that within the computers, a variety of architectures maybe used. For example, some of the memory of each computer may be sharedbetween processors in each computer, while some of the memory of eachcomputer may be local to each of the processors in each computer. Itwill further be understood that although the channel 804 is shown as asingle element in the drawing, there may be various components thathandle a high-speed interlink between the computers 802 a-802 c. Forexample, in some systems, specialized coupling devices (not shown) maybe used to control high-speed communication traffic between thecomputers 802 a-802 c.

The computing environment 800 may also include a shared storage facility814 which is connected to the channel 804 to provide shared data storageto the computers 802 a-802 c. As will be understood, there may beadditional equipment involved in providing the shared storage facility814. For example, a device (not shown) that determines whether incomingrequests to the shared storage facility need to be serialized to avoidconflicts may be used in association with the shared storage facility.

There are, of course, many other computing environments with whichembodiments of the disclosure could be used, and the computingenvironment 800 is used only for purposes of illustration. Applying thereorg methods of the disclosure by improve the functioning of manycomputing systems and environments, such as parallel, clustered, anddistributed computing systems, during a database reorg operation.

There is thus disclosed a system of one or more computers that can beconfigured to perform particular operations or actions by virtue ofhaving software, firmware, hardware, or a combination of them installedon the system that in operation causes or cause the system to performthe actions. One or more computer programs can be configured to performparticular operations or actions by virtue of including instructionsthat, when executed by data processing apparatus, cause the apparatus toperform the actions.

One general aspect includes a method of reorganizing a tablespace in adatabase such that rows of the tablespace are arranged in a sequencedefined in a balanced tree-type clustering index of the tablespace, themethod including: sectioning, by a processor, the clustering index andthe tablespace into sections including logically distinct sets of databy reading only tree pages of the clustering index to determine logicaldivisions; allocating, by a processor, an amount of output space on astorage device for each section of the tablespace and of the clusteringindex, to provide for each section a first range of storage space for anoutput clustering index for the section, and a second range of storagespace for an output tablespace for the section; scheduling, by aprocessor, a reorg task for each section; and executing, by at least oneprocessor, the scheduled reorg tasks on the sections. Other embodimentsof this aspect include corresponding computer systems, apparatus, andcomputer programs recorded on one or more computer storage devices, eachconfigured to perform the actions of the methods.

Implementations may include one or more of the following features. Themethod where executing, by at least one processor, the scheduled reorgtasks on the sections includes, for each section: reading a leaf page inthe section; reading the tablespace rows that are pointed to by the leafpage in logical order; writing the tablespace rows into the second rangeof storage space for the output tablespace, such that a physical orderof the rows in the second range of storage space corresponds to thelogical order; and preparing new entries in one or more output leafpages for the output clustering index. The method where writing thetablespace rows into the second range of storage space for the outputtablespace includes writing into storage space allocated for a nextsegment when the second range of storage space is insufficient tocontain the output tablespace. The method where executing, by at leastone processor, the scheduled reorg tasks includes processing a pluralityof the reorg tasks in parallel. The method where processing a pluralityof the reorg tasks in parallel includes executing the plurality of thereorg tasks on a computing system selected from a set including aparallel computing system, a clustered computing system, and adistributed computing system. The method where executing, by at leastone processor, the scheduled reorg tasks includes processing thescheduled reorg tasks on a SYSPLEX under a version of a Z/OS operatingsystem. The method where sectioning, by a processor, the clusteringindex and the tablespace into sections includes assembling a new keybitmap, based on information in level 2 tree pages of the clusteringindex. The method where assembling the new key bitmap includes assigninga bit for each leaf page pointed to by the level 2 tree pages of theclustering index, such that the bit is on if it is known that a new keystarts in the leaf page corresponding to the bit, and the bit is off ifeither it is known that there is not a new key in the leaf pagecorresponding to the bit or if the decision is uncertain. The methodwhere sectioning, by a processor, the clustering index and thetablespace into sections further includes defining a section such thatthe section starts on a leaf page for which the corresponding bit in thenew key bitmap is on, and ends on a leaf page for which thecorresponding bit in the new key bitmap is on. The method wheresectioning, by a processor, the clustering index and the tablespace intosections includes selecting start and end leaf pages for a section suchthat a size of the tablespace section is approximately a predeterminedtarget size. The method where scheduling, by a processor, a reorg taskfor each section includes scheduling the reorg task for each sectionsuch that, except for during a final scheduling pass, the reorg task fora section is scheduled only if a next physical section has not yet beenprocessed. The method where scheduling the reorg task for each sectionincludes scheduling the reorg tasks in rounds, such that an intervalbetween sections for which reorg tasks are scheduled in a roundseparates the sections by a size that compensates for varying sectionsizes. The method where the interval between sections for which reorgtasks are scheduled in a round is the integer value of the square rootof a total number of sections. The method where scheduling, by aprocessor, a reorg task for each section includes scheduling a pluralityof packets to run concurrently within each section. The method wherescheduling a plurality of packets to run concurrently within eachsection includes scheduling at least two packets of the plurality ofpackets to finish serially. The method where scheduling a plurality ofpackets to run concurrently within each section includes scheduling atleast one packet of the plurality of packets to handle writing data forthe section, and scheduling at least one packet of the plurality ofpackets to concurrently handle row processing work within the section.The method where scheduling at least one packet of the plurality ofpackets to concurrently handle row processing work within the sectionincludes scheduling a packet to handle row processing work selected froma set including of decompression, compression, discard processing, androw rebuilding. The method where scheduling a plurality of packets torun concurrently within each section includes scheduling a number ofpackets within a section according to a memory limitation. The methodfurther including executing a wrap-up process to resolve space map pagesfor the output tablespace and for the output clustering index, and towrite the highest level tree pages. The method where the databaseincludes a version of a DB2 database management system. Implementationsof the described techniques may include hardware, a method or process,or computer software on a computer-accessible medium.

One general aspect includes a system including: a storage facilitystoring a tablespace of a database and a balanced tree-type clusteringindex of the tablespace, the storage facility connected to a channel andconfigured to transfer data over the channel; and a computer includingat least one processor, a memory, and communication circuitry connectedto the channel and configured to transfer data over the channel. Thecomputer is configured to reorganize the tablespace such that rows ofthe tablespace are arranged in a sequence defined in the clusteringindex of the tablespace by executing instructions that cause the atleast one processor to: section the clustering index and the tablespaceinto sections including logically distinct sets of data by reading onlytree pages of the clustering index to determine logical divisions;allocate an amount of output space on the storage facility for eachsection of the tablespace and of the clustering index, to provide foreach section a first range of storage space for an output clusteringindex for the section, and a second range of storage space for an outputtablespace for the section; schedule a reorg task for each section; andexecute the scheduled reorg tasks on the sections.

Implementations may include one or more of the following features. Thesystem where the computer is configured to execute the scheduled reorgtasks on the sections by executing instructions that cause the at leastone processor to: read a leaf page in the section from the storagefacility; read the tablespace rows that are pointed to by the leaf pagein logical order from the storage facility; write the tablespace rowsinto the second range of storage space for the output tablespace, suchthat the physical order of the rows in the second range of storage spacecorresponds to the logical order; and prepare new entries in one or moreoutput leaf pages for the output clustering index. The system where thecomputer includes a plurality of processors, and where the computer isconfigured to execute the scheduled reorg tasks by processing aplurality of the reorg tasks in parallel on the plurality of processors.The system further including at least one additional computer connectedto the channel and configured to transfer data over the channel, andwhere the computer is configured to execute the scheduled reorg tasks byexecuting a first reorg task on the computer and a second reorg task onthe additional computer. The system where the computer is configured toschedule a reorg task for each section by executing instructions thatcause the processor to schedule a plurality of packets to runconcurrently within each section. The system where the system isconfigured as a SYSPLEX operating under a version of a Z/OS operatingsystem. The system where the database includes a version of a DB2database management system. Implementations of the described techniquesmay include hardware, a method or process, or computer software on acomputer-accessible medium.

Various other components and processes may be included and called uponfor providing for aspects of the teachings herein. For example,additional materials, processes, combinations of materials, combinationsof processes, and/or omission of materials or processes may be used toprovide for added embodiments that are within the scope of the teachingsherein.

Standards for performance, selection of materials, functionality, andother discretionary aspects are to be determined by a user, designer,manufacturer, or other similarly interested party. Any standardsexpressed herein are merely illustrative and are not limiting of theteachings herein.

When introducing elements of the present invention or the embodiment(s)thereof, the articles “a,” “an,” and “the” are intended to mean thatthere are one or more of the elements. Similarly, the adjective“another,” when used to introduce an element, is intended to mean one ormore elements. The terms “including” and “having” are intended to beinclusive such that there may be additional elements other than thelisted elements.

While the invention has been described with reference to illustrativeembodiments, it will be understood by those skilled in the art thatvarious changes may be made and equivalents may be substituted forelements thereof without departing from the scope of the invention.Although the title of the invention is “REORGANIZATION OF DATABASES BYSECTIONING”, this title is not intended to be limiting, and insteadrefers to particular examples described herein. Similarly, the field ofthe invention and description of related art are not intended to belimiting. As discussed above, while examples relating to the DB2database management system running under the Z/OS operating system arediscussed, it will be understood that the various teachings providedherein could also be applied to other database systems having similarclustering indexes and running under different operating systems or indifferent operating environments. In addition, many modifications willbe appreciated by those skilled in the art to adapt a particularinstrument, situation, process, or material to the teachings of theinvention without departing from the essential scope thereof. Therefore,it is intended that the invention not be limited to the particularembodiment disclosed as the best mode contemplated for carrying out thisinvention, but that the invention will include all embodiments fallingwithin the scope of the appended claims.

What is claimed is:
 1. A method of reorganizing a tablespace in adatabase such that rows of the tablespace are arranged in a sequencedefined in a clustering index of the tablespace, the method comprising:sectioning the clustering index and the tablespace into sectionscomprising logically distinct sets of data; allocating output space on astorage device for each section of the tablespace and of the clusteringindex, to provide for each section a first range of storage space for anoutput clustering index for the section, and a second range of storagespace for an output tablespace for the section; scheduling a reorg taskfor each section; and. executing the scheduled reorg tasks on thesections.
 2. The method of claim 1, wherein executing the scheduledreorg tasks on the sections comprises, for each section: reading thetablespace rows in the section; writing the tablespace rows into thesecond range of storage space for the output tablespace, such that aphysical order of the rows in the second range of storage spacecorresponds to a logical order; and preparing new entries in the outputclustering index.
 3. The method of claim 1, wherein executing thescheduled reorg tasks comprises processing a plurality of the reorgtasks in parallel.
 4. A system comprising: a storage facility storing atablespace of a database and a clustering index of the tablespace, thestorage facility connected to a channel and configured to transfer dataover the channel; and a computer comprising at least one processor, amemory, and communication circuitry connected to the channel andconfigured to transfer data over the channel, wherein the computer isconfigured to reorganize the tablespace such that rows of the tablespaceare arranged in a sequence defined in the clustering index by executinginstructions that cause the at least one processor to: section theclustering index and the tablespace into sections comprising logicallydistinct sets of data; allocate an amount of output space on the storagefacility for each section of the tablespace and of the clustering index,to provide for each section a first range of storage space for an outputclustering index for the section, and a second range of storage spacefor an output tablespace for the section; schedule a reorg task for eachsection; and execute the scheduled reorg tasks on the sections.
 5. Thesystem of claim 4, wherein the computer comprises a plurality ofprocessors, and wherein the computer is configured to execute thescheduled reorg tasks by processing a plurality of the reorg tasks inparallel on the plurality of processors.